Activation Oracles

mentions 1 type Person feed RSS

// recent coverage 1 mentions

02:54

2026-06-24

lesswrong.com

natural-language-processing

Can You Hide From a Natural Language Autoencoder?

Researchers stress-tested Natural Language Autoencoders (NLAs) by optimizing activation vectors to flip AV explanations while preserving model behavior, achieving an 81.4% flip rate with 99.6% label p…

// co-occurs with top 5 entities

Qwen 1 Anthropic 1 Qwen2.5-7B-Instruct 1 Activation Verbalizer 1 Neural Chameleons 1